Human Genetics and Genomics Advances — Latest Matching Preprints

1

Data Simulation to Optimize the GWAS Framework in Diverse Populations

Mugo, J. W.; Chimusa, E. R.; Mulder, N. J.

2023-10-26 genetic and genomic medicine 10.1101/2023.10.26.23297606 medRxiv

Top 0.1%

25.6%

Show abstract

Whole-genome or genome-wide association studies have become a fundamental part of modern genetic studies and methods for dissecting the genetic architecture of common traits based on common polymorphisms in random populations. It is hoped that there will be many potential uses of these identified variants, including a better understanding of the pathogenesis of traits, the discovery of biomarkers and protein targets, and the clinical prediction of drug treatments for global health. Questions have been raised on whether associations that are largely discovered in populations of European descent are replicable in diverse populations, can inform medical decision-making globally, and how efficiently current GWAS tools perform in populations of high genetic diversity, multi-wave genetic admixture, and low linkage disequilibrium (LD), such as African populations. In this study, we employ genomic data simulation to mimic structured African, European, and multi-way admixed populations to evaluate the replicability of association signals from current state-of-the-art GWAS tools in these populations. We then leverage the results to discuss an optimized framework for the analysis of GWAS data in diverse populations and outline the implications, challenges, and opportunities these studies present for populations of non-European descent.

2

Improving imputation quality in Samoans through the integration of population-specific sequences into existing reference panels

Carlson, J. C.; Krishnan, M.; Liu, S.; Anderson, K. J.; Zhang, J. Z.; Yapp, T.-A. J.; Chiyka, E. A.; Dikec, D. A.; Cheng, H.; Naseri, T.; Reupena, M. S.; Viali, S.; Deka, R.; Hawley, N. L.; McGarvey, S. T.; Weeks, D. E.; Minster, R. L.

2023-10-31 genetic and genomic medicine 10.1101/2023.10.31.23297835 medRxiv

Top 0.1%

19.2%

Show abstract

Genotype imputation is fundamental to association studies, and yet even gold standard panels like TOPMed are limited in the populations for which they yield good imputation. Specifically, Pacific Islanders are poorly represented in extant panels. To address this, we constructed an imputation reference panel using 1,285 Samoan individuals with whole-genome sequencing, combined with 1000 Genomes Project (1KGP) individuals, to create a reference panel that better represents Pacific Islander, specifically Samoan, genetic variation. We compared this panel to 1KGP and TOPMed-R3 panels based on imputed variants using genotyping array data for 1,834 Samoan participants who were not part of the panels. The 1KGP + 1285 Samoan panel yielded up to two times more well-imputed (r2 [≥] 0.80) variants than TOPMed-R3 and 1KGP and was enriched for moderate and high impact variants. There was improved imputation accuracy across the minor allele frequency (MAF) spectrum, although it was most pronounced for variants with 0.01 [≤] MAF [≤] 0.05. Imputation accuracy (r2) was greater for population-specific variants (high fixation index, FST) and those from larger haplotypes (high LD score). However, the gain in imputation accuracy over TOPMed-R3 was largest for small haplotypes (low LD score), reflecting the Samoan panels ability to capture population-specific variation not well tagged by other panels. We also augmented the 1KGP reference panel with varying numbers of Samoan participants and found that panels with 24 Samoans yielded similar performance to TOPMed-R3, and panels with 48 or more Samoans included outperformed TOPMed-R3 for all variants with MAF [≥] 0.001. Meta imputation of the TOPMed-R3 and 1285 Samoan panels yielded poorer performance than the Samoan only panel. We also demonstrated that the phasing of the reference panel impacts the imputation of population-specific variants when the reference panel is composed of individuals from an isolated population and not combined with ancestrally diverse haplotypes. This study identifies variants with improved imputation using population-specific reference panels and provides a framework for constructing other population-specific reference panels.

3

Microsatellites used in forensics are located in regions unusually rich in trait-associated variants

Link, V.; Zavaleta, Y. J. A.; Reyes, R.-J.; Ding, L.; Wang, J.; Rohlfs, R. V.; Edge, M. D.

2023-03-10 genetics 10.1101/2023.03.07.531629 medRxiv

Top 0.1%

17.3%

Show abstract

The 20 short tandem repeat (STR) markers of the combined DNA index system (CODIS) are the basis of the vast majority of forensic genetics in the United States. One argument for permissive rules about the collection of CODIS genotypes is that the CODIS markers are thought to contain information relevant to identification only (such as a human fingerprint would), with little information about ancestry or traits. However, in the past 20 years, a quickly growing field has identified hundreds of thousands of genotype-trait associations. Here we conduct a survey of the landscape of such associations surrounding the CODIS loci as compared with non-CODIS STRs. We find that the regions around the CODIS markers are enriched for both known pathogenic variants (>90th percentile) and for SNPs identified as trait-associated in genome-wide association studies (GWAS) ([≥]95th percentile in 10kb and 100kb flanking regions), compared with other random sets of autosomal tetranucleotide-repeat STRs. Although it is not obvious how much phenotypic information CODIS would need to convey to strain the "DNA fingerprint" analogy, the CODIS markers, considered as a set, are in regions unusually dense with variants with known phenotypic associations.

4

Improving GWAS performance in underrepresented groups by appropriate modeling of genetics, environment, and sociocultural factors

Cataldo-Ramirez, C.; Lin, M.; McMahon, A.; Gignoux, C.; Weaver, T. D.; Henn, B. M.

2026-04-08 genetics 10.1101/2024.10.28.620716 medRxiv

Top 0.1%

16.7%

Show abstract

Genome-wide association studies (GWAS) and polygenic score (PGS) development are typically constrained by the data available in biobank repositories in which European cohorts are vastly overrepresented. Here, we increase the utility of non-European participant data within the UK Biobank (UKB) by characterizing the genetic affinities of UKB participants who self-identify as Bangladeshi, Indian, Pakistani, "White and Asian" (WA), and "Any Other Asian" (AOA), towards creating a more robust South Asian sample size for future genetic analyses. We assess the relationships between genetic structure and self-selected ethnic identities and use consistent patterns of clustering in the dataset to train a support vector machine (SVM). The SVM was utilized to reassign n = 1,853 AOA and WA participants at the subcontinental level, and increase the sample size of the UKB South Asian group by 1,381 additional participants. We further leverage these samples to assess GWAS performance and PGS development. We include environmental covariates in the height GWAS by implementing a rigorous covariate selection procedure, and compare the outputs of two GWAS models: GWASnull and GWASenv. We show that PGS performance derived from both GWAS models yield comparable prediction to PGS models developed with an order of magnitude larger training, and environmentally-adjusted PGS models reduce the sex-bias in predictive performance. In summary, we demonstrate how GWAS performance can be improved by leveraging ambiguous ethnicity codes, ancestry matched imputation panels, and including environmental covariates.

5

Non-coding genetic variants underlying higher prostate cancer risk in men of African ancestry

Li, S.; Fatema, K.; Nidharshan, S.; Singh, A.; Rajagopal, P. S.; Notani, D.; Takeda, D.; Hannenhalli, S.

2024-11-15 genetic and genomic medicine 10.1101/2024.11.14.24317278 medRxiv

Top 0.1%

14.2%

Show abstract

Incidence and severity of prostate cancer (PrCa) substantially varies across ancestries. American men of African ancestry (AA) are more likely to be diagnosed with and die from PrCa than the those of European ancestry (EA). Published polygenic risk scores for developing prostate cancer, even those based on multi-ancestry genome-wide association studies, do not address population-specific genetic mechanisms underlying PrCa risk in men of African ancestry. Specifically, the role of non-coding regulatory polymorphisms in driving inter-ancestry variation in PrCa has not been sufficiently explored. Here, by employing a sequence-based deep learning model of prostate regulatory enhancers, we identified [~]2,000 SNPs with higher alternate allele frequency in AA men that potentially affect enhancer function associated with PrCa susceptibility, as supported by our experimental validation. The identified enhancer SNPs (eSNPs) may influence PrCa development through two complementary mechanisms: 1) the alternate allele that increase enhancer activity result in immune suppression and telomere elongation, and 2) the alternate alleles that decrease enhancer activity, lead to de-differentiation and inhibition of apoptosis. Notably, the eSNPs tend to disrupt the binding of known prostate transcription factors including FOX, AR and HOX families. Lastly, the identified eSNPs can be combined into a polygenic risk score that adds value to current GWAS-based risk variants in assessing PrCa risk in independent cohorts.

6

Using HiFi Long-Read Whole Genome Sequencing To Enhance Diagnosis In Patients With Subfertility And/Or Recurrent Pregnancy Loss

Teo, J. X.; Cheawsamoot, C.; Kim, D.; Goh, J. C.-Y.; Kam, S.; Chan, S. S.-M.; Yang, L.; Liu, S.; Chua, K. P.; Cheng, W.; Ma, G.-C.; Chang, T.-Y.; Lin, Y.-S.; Wu, K.-M.; Yu, E. J.; Kim, Y.; Seong, M.-W.; Thuwanut, P.; Tuntiviriyapun, P.; Suebthawinkul, C.; Srichomthong, C.; Chetruengchai, W.; Kanlayaprasit, S.; Wongong, R.; Korlach, J.; Lee, J.-S.; Chen, M.; Hwang, S.; Lim, W. K.; Shotelersuk, V.; Jamuar, S. S.

2026-05-08 sexual and reproductive health 10.64898/2026.05.01.26352136 medRxiv

Top 0.1%

14.1%

Show abstract

Subfertility and recurrent pregnancy loss (RPL) affect a significant proportion of couples worldwide. Genetic causes can be seen in up to 30% of these individuals but require multiple genetic tests, which often impede a comprehensive work up. Newer genomic technologies, such as PacBio HiFi long read sequencing (LRS) can detect most subclasses of variations (such as structural rearrangement, monogenic disorders) through one single test. In this multicenter study, we enrolled couples with unexplained subfertility and/or RPL and performed HiFi LRS to determine the underlying genetic etiology. Participants were recruited using a standardized inclusion/ exclusion criteria to rule out other known causes of subfertility and/or RPL. 96 individuals were recruited across the 5 sites. Average age of participants was 36 years (range 30-46 years). Among the 84 individuals who completed sequencing, 4.8% were identified with a likely genetic diagnosis and variants of uncertain significance were identified in another 14.2% of individuals. One individual was identified with an ACMG secondary finding, and while multiple carriers for recessive genetic disorders were identified, none of the couples were identified to be at increased risk. This study highlights the utility of performing genomic sequencing in couples with unexplained subfertility and/or RPL, with 1 in 10 couples harboring a clinically significant variant. In addition, use of HiFi LRS allowed for characterization of different subclasses of genomic variations through a single test. Future studies, including exploring the cost effectiveness and resource utilization of LRS as first line test, will help in optimizing care for such couples. TWEETABLE STATEMENTA single long-read genome sequencing test can consolidate multiple genetic investigations and uncover clinically relevant causes in couples with unexplained subfertility and recurrent pregnancy loss. AT A GLANCEO_LIWhy was this study conducted? O_LIMany couples with subfertility and recurrent pregnancy loss remain undiagnosed after multiple conventional genetic tests C_LIO_LIExisting workflows require sequential testing and may miss complex genomic variants C_LI C_LIO_LIWhat are the key findings? O_LILong-read genome sequencing identified clinically relevant variants in [~]1 in 10 couples with unexplained subfertility or recurrent pregnancy loss C_LIO_LIA single assay enabled detection of multiple variant types, including structural and sequence variants C_LI C_LIO_LIWhat does this study add to what is already known? O_LIDemonstrates feasibility of a unified genomic testing approach in a real-world multicenter cohort C_LIO_LISupports a potential shift from fragmented testing toward a single comprehensive genomic workflow C_LI C_LI

7

In vivo versus in silico assessment of potentially pathogenic missense variants in human reproductive genes

Ding, X.; Singh, P.; Tran, T. N.; Fragoza, R.; Yu, H.; Schimenti, J. C.

2021-10-12 genetics 10.1101/2021.10.12.464112 medRxiv

Top 0.1%

14.1%

Show abstract

Infertility is a heterogeneous condition, with genetic causes estimated to be involved in approximately half of the cases. High-throughput sequencing (HTS) is becoming an increasingly important tool for genetic diagnosis of diseases including idiopathic infertility, however, most rare or minor alleles revealed by HTS are variants of uncertain significance (VUS). Interpreting the functional impacts of VUS is challenging but profoundly important for clinical management and genetic counseling. To determine the consequences of population polymorphisms in key fertility genes, we functionally evaluated 11 missense variants in the genes ANKRD31, BRDT, DMC1, EXOI, FKBP6, MCM9, M1AP, MEI1, MSH4 and SEPT12 by generating genome-edited mouse models. Nine variants were classified as deleterious by most functional prediction algorithms, and two disrupted a protein-protein interaction in the yeast 2 hybrid assay. Even though these genes are known to be essential for normal meiosis or spermiogenesis in mice, only one of the tested human variants (rs1460351219, encoding p.R581H in MCM9), which was observed in a male infertility patient, compromised fertility or gametogenesis in the mouse models. To explore the disconnect between predictions and outcomes, we compared pathogenicity calls of missense variants made by ten widely-used algorithms to: 1) those present in ClinVar, and 2) those which have been evaluated in mice. We found that all the algorithms performed poorly in terms of predicting the effects of human missense variants that have been modeled in mice. These studies emphasize caution in the genetic diagnoses of infertile patients based primarily on pathogenicity prediction algorithms, and emphasize the need for alternative and efficient in vitro or vivo functional validation models for more effective and accurate VUS delineation to either pathogenic or benign categories. SignificanceAlthough infertility is a substantial medical problem that affects up to 15% of couples, the potential genetic causes of idiopathic infertility have been difficult to decipher. This problem is complicated by the large number of genes that can cause infertility when perturbed, coupled with the large number of VUS that are present in the genomes of affected patients. Here, we present and analyze mouse modeling data of missense variants that are classified as deleterious by commonly-used pathogenicity prediction algorithms but which caused no detectible phenotype when introduced into mice by genome editing. We find that augmenting pathogenicity predictions with preliminary screens for biochemical defects substantially enhanced the proportion of prioritized variants that caused phenotypes in mice. The results emphasize that, in the absence of substantial improvements of in silico prediction tools or other compelling pre-existing evidence, in vivo analysis is crucial for confident attribution of infertility alleles.

8

Explicitly modeling genetic ancestry to improve polygenic prediction accuracy for height in a large, admixed cohort of US Latinos: Findings from HCHS/SOL

Wang, X.; Sofer, T.; Frei, O.; Kaplan, R.; Perreira, K. M.; Franceschini, N.; Parada, H.; Zhou, L.; Andreassen, O. A.; Gonzalez, H.; Dale, A. M.; Broce, I. J.

2025-03-23 genetic and genomic medicine 10.1101/2025.03.21.25324423 medRxiv

Top 0.1%

12.8%

Show abstract

Polygenic scores (PGS) offer moderate to high prediction accuracy for complex traits, but most are developed in European ancestry cohorts, reducing their performance in populations of other ancestries. This study aimed to improve standing height prediction, a heritable and ancestry-influenced trait, in an admixed Latino cohort (HCHS/SOL) by modeling ancestry using principal components (PCs) alongside PGS. SNPs were selected from a large European ancestry GWAS using various p-value thresholds, and weights were trained using traditional and penalized regression in the UK Biobank (UKB). PGS with PCs were trained separately in HCHS/SOL and UKB. Compared to PGS alone, modeling PGS with PCs substantially improved height prediction in HCHS/SOL (R{superscript 2} increase of [~]0.1), while mild improvements were observed in UKB (R{superscript 2} increase of [~]0.01). These results underscore the importance of incorporating genetic ancestry into predictive models for admixed populations, particularly when the trait exhibits ancestry-specific associations.

9

Inactivating PLEKHA6 Mutations Cause Idiopathic Hypogonadotropic Hypogonadism Through Impaired Kisspeptin Secretion

Topaloglu, A. K.; Plummer, L.; Su, C.-W.; Kotan, L. D.; Celmeli, G.; Simsek, E.; Zhao, Y.; Stamou, M.; Anik, A.; Döger, E.; Altıncık, S. A.; Mengen, E.; Koc, A. F.; Akkus, G.; Balasubramanian, R.; Turan, I.; Seminara, S. B.; Yuksel, B.

2026-04-13 pediatrics 10.64898/2026.04.10.26349358 medRxiv

Top 0.1%

12.8%

Show abstract

PurposeIdiopathic hypogonadotropic hypogonadism (IHH) is characterized by impaired reproductive maturation, and approximately half of all cases lack an identified genetic cause. We investigated the genetic basis of IHH in two large cohorts to identify novel disease-causing genes. MethodsWe analyzed exome and genome sequencing data from 1,822 patients with IHH from two independent cohorts. Rare variants were filtered using pedigree-informed inheritance models. PLEKHA6 expression in the postmortem human hypothalamus were tested at the mRNA and protein level. Functional studies assessed kisspeptin secretion in cell-based assays. ResultsWe identified 18 distinct PLEKHA6 variants in 24 patients from 20 unrelated families (1.3% of cohort). Variants segregated with disease under autosomal recessive and autosomal dominant (with variable penetrance) inheritance patterns. PLEKHA6 was robustly expressed in the hypothalamus and showed clear colocalization with neurokinin B, which served as the marker for the GnRH pulse generator. Functional studies demonstrated that patient variants significantly impaired kisspeptin secretion. ConclusionPLEKHA6 is a novel IHH gene and the first reported regulator of kisspeptin secretion from the kisspeptin-neurokinin B-dynorphin (KNDy) neurons, which have recently been established as the GnRH pulse generator. These findings establish impaired kisspeptin release as a new disease mechanism in IHH and highlight the critical role of neuropeptide trafficking in reproductive function.

10

Integrative approaches to improve the informativeness of deep learning models for human complex diseases

Dey, K. K.; Kim, S. S.; Gazal, S.; Nasser, J.; Engreitz, J. M.; Price, A.

2020-09-09 genetics 10.1101/2020.09.08.288563 medRxiv

Top 0.1%

12.0%

Show abstract

Deep learning models have achieved great success in predicting genome-wide regulatory effects from DNA sequence, but recent work has reported that SNP annotations derived from these predictions contribute limited unique information for human complex disease. Here, we explore three integrative approaches to improve the disease informativeness of allelic-effect annotations (predicted difference between reference and variant alleles) constructed using several previously trained deep learning models: DeepSEA, Basenji and DeepBind (and a related machine learning model, deltaSVM). First, we employ gradient boosting to learn optimal combinations of deep learning annotations, using fine-mapped SNPs and matched control SNPs (on held-out chromosomes) for training. Second, we improve the specificity of these annotations by restricting them to SNPs implicated by (proximal and distal) SNP-to-gene (S2G) linking strategies, e.g. prioritizing SNPs involved in gene regulation. Third, we predict gene expression (and derive allelic-effect annotations) from deep learning annotations at SNPs implicated by S2G linking strategies -- generalizing the previously proposed ExPecto approach, which incorporates deep learning annotations based on distance to TSS. We evaluated these approaches using stratified LD score regression, using functional data in blood and focusing on 11 autoimmune diseases and blood-related traits (average N =306K). We determined that the three approaches produced SNP annotations that were uniquely informative for these diseases/traits, despite the fact that linear combinations of the underlying DeepSEA, Basenji, DeepBind and deltaSVM blood annotations were not uniquely informative for these diseases/traits. Our results highlight the benefits of integrating SNP annotations produced by deep learning models with other types of data, including data linking SNPs to genes.

11

Evaluating the Contribution of Genome 3D Folding to Variation in Human Height Using Machine Learning

Gu, W.; Gilbertson, E.; Baranzini, S. E.; Salem, R.; Capra, J. A.

2025-09-15 genetics 10.1101/2025.09.09.675195 medRxiv

Top 0.1%

11.7%

Show abstract

Genome-wide association studies (GWAS) have identified thousands of variants associated with complex traits, yet the majority lie in noncoding regions, making it difficult to determine their functional impact. Alterations to the three-dimensional (3D) spatial interactions among gene regulatory elements are increasingly recognized as a mechanism by which genetic variants influence gene expression. However, experimentally evaluating whether variants disrupt 3D-genome structure is not feasible at GWAS scale. To address this, we developed a computational framework that integrates GWAS summary statistics with predictions from the Akita sequence-based deep learning model of 3D chromatin contacts. We applied the framework to 9,917 genomic regions associated with human height, assessing both individual variants and haplotypes for their predicted impact on 3D genome architecture. Only a small fraction of height-associated haplotypes had substantial predicted disruption of 3D folding (17 regions, 0.17%, exceeded a disruption score of 0.1). Considering all common variants in a haplotype together generally produced greater perturbations than individual variants, but several highly divergent regions were driven by single variants. We highlight a variant that disrupts the binding motif at a confirmed CTCF binding site and is predicted to modify 3D genome contacts with the LCOR promoter, suggesting that 3D-genome-mediated disruption of gene regulation underlies the association with height. This work presents a scalable and interpretable strategy for integrating 3D genome modeling with GWAS, enabling investigation of this important regulatory mechanism in the connection of non-coding genetic variation to complex traits.

12

The accuracy of polygenic score models for anthropometric traits and Type II Diabetes in the Native Hawaiian Population

Lo, Y.-C.; Chan, T. F.; Jeon, S.; Maskarinec, G.; Taparra, K.; Nakatsuka, N.; Yu, M.; Chen, C.-Y.; Lin, Y.-F.; Wilkens, L. R.; Le Marchand, L.; Haiman, C. A.; Chiang, C. W. K.

2023-12-28 genetic and genomic medicine 10.1101/2023.12.25.23300499 medRxiv

Top 0.1%

10.1%

Show abstract

Polygenic scores (PGS) are promising in stratifying individuals based on the genetic susceptibility to complex diseases or traits. However, the accuracy of PGS models, typically trained in European- or East Asian-ancestry populations, tend to perform poorly in other ethnic minority populations, and their accuracies have not been evaluated for Native Hawaiians. Using body mass index, height, and type-2 diabetes as examples of highly polygenic traits, we evaluated the prediction accuracies of PGS models in a large Native Hawaiian sample from the Multiethnic Cohort with up to 5,300 individuals. We evaluated both publicly available PGS models or genome-wide PGS models trained in this study using the largest available GWAS. We found evidence of lowered prediction accuracies for the PGS models in some cases, particularly for height. We also found that using the Native Hawaiian samples as an optimization cohort during training did not consistently improve PGS performance. Moreover, even the best performing PGS models among Native Hawaiians would have lowered prediction accuracy among the subset of individuals most enriched with Polynesian ancestry. Our findings indicate that factors such as admixture histories, sample size and diversity in GWAS can influence PGS performance for complex traits among Native Hawaiian samples. This study provides an initial survey of PGS performance among Native Hawaiians and exposes the current gaps and challenges associated with improving polygenic prediction models for underrepresented minority populations.

13

Quantifying factors that affect polygenic risk score performance across diverse ancestries and age groups for body mass index

Hui, D.; Xiao, B.; Dikilitas, O.; Freimuth, R. R.; Irvin, M. R.; Jarvik, G. P.; Kottyan, L.; Kullo, I.; Limdi, N. A.; Liu, C.; Luo, Y.; Namjou, B.; Puckelwartz, M. J.; Schaid, D.; Tiwari, H.; Wei, W.-Q.; Verma, S. S.; Kim, D.; Ritchie, M. D.

2022-05-28 genetic and genomic medicine 10.1101/2022.05.27.22275647 medRxiv

Top 0.1%

9.9%

Show abstract

Polygenic risk scores (PRS) have led to enthusiasm for precision medicine. However, it is well documented that PRS do not generalize across groups differing in ancestry or sample characteristics e.g., age. Quantifying performance of PRS across different groups of study participants, using genome-wide association study (GWAS) summary statistics from multiple ancestry groups and sample sizes, and using different linkage disequilibrium (LD) reference panels may clarify factors limiting PRS transferability. To evaluate these factors in the PRS generation process, we generated body mass index (BMI) PRS (PRSBMI) in the Electronic Medical Records and Genomics network (N=75,661). Analyses were conducted in two ancestry groups (European and African) and three age ranges (adult, teenagers, and children). For PRSBMI calculations, we evaluated five LD reference panels and three GWAS summary statistics of varying sample size and ancestry. PRSBMI performance increased for both African and European ancestry individuals using cross-ancestry GWAS summary statistics compared to European-only summary statistics (6.3% and 3.7% relative R2 increase, respectively, pAfrican=0.038, pEuropean=6.26x10-4). The effects of LD reference panels were more pronounced in African ancestry study datasets. PRSBMI performance degraded in children; R2 was less than half of teenagers or adults. The effect of GWAS summary statistics sample size was small when modeled with the other factors. We also explored clinical comorbidities associated with the PRSBMI and identified associations with type 2 diabetes and coronary atherosclerosis. This study quantifies effects that ancestry, GWAS summary statistic sample size, and LD reference panel have on PRS performance, especially in cross-ancestry and age-specific analyses.

14

Population specific reference panels are crucial for the genetic analyses of Native Hawai’ians: an example of the CREBRF locus

Lin, M.; Caberto, C.; Wan, P.; Li, Y.; Lum-Jones, A.; Tiirikainen, M.; Pooler, L.; Nakamura, B.; Sheng, X.; Porcel, J.; Lim, U.; Setiawa, V. W.; Le Marchand, L.; Wilkens, L. R.; Haiman, C. A.; Cheng, I.; Chiang, C. W. K.

2019-10-01 genetics 10.1101/789073 medRxiv

Top 0.1%

9.9%

Show abstract

Statistical imputation applied to genome-wide array data is the most cost-effective approach to complete the catalog of genetic variation in a study population. However, imputed genotypes in underrepresented populations incur greater inaccuracies due to ascertainment bias and a lack of representation among reference individuals,, further contributing to the obstacles to study these populations. Here we examined the consequences due to the lack of representation by genotyping a functionally important, Polynesian-specific variant, rs373863828, in the CREBRF gene, in a large number of self-reported Native Hawaiians (N=3,693) from the Multiethnic Cohort. We found the derived allele of rs373863828 was significantly associated with several adiposity traits with large effects (e.g. 0.214 s.d., or approximately 1.28 kg/m2, per allele, in BMI as the most significant; P = 7.5x10-5). Due to the current absence of Polynesian representation in publicly accessible reference sequences, rs373863828 or any of its proxies could not be tested through imputation using these existing resources. Moreover, the association signals at this Polynesian-specific variant could not be captured by alternative approaches, such as admixture mapping. In contrast, highly accurate imputation can be achieved even if a small number (<200) of Polynesian reference individuals were available. By constructing an internal set of Polynesian reference individuals, we were able to increase sample size for analysis up to 3,936 individuals, and improved the statistical evidence of association (e.g. p = 1.5x10-7, 3x10-6, and 1.4x10-4 for BMI, hip circumference, and T2D, respectively). Taken together, our results suggest the alarming possibility that lack of representation in reference panels would inhibit discovery of functionally important, population-specific loci such as CREBRF. Yet, they could be easily detected and prioritized with improved representation of diverse populations in sequencing studies.

15

Systematic comparison of phenome-wide admixture mapping and genome-wide association in a diverse biobank

Cullina, S.; Shemirani, R.; Asgari, S.; Kenny, E. E.

2024-11-18 genetic and genomic medicine 10.1101/2024.11.18.24317494 medRxiv

Top 0.1%

9.8%

Show abstract

Biobank-scale association studies that include Hispanic/Latino(a) (HL) and African American (AA) populations remain underrepresented, limiting the discovery of disease associated genetic factors in these groups. We present here a systematic comparison of phenome-wide admixture mapping (AM) and genome-wide association (GWAS) using data from the diverse BioMe biobank in New York City. Our analysis highlights 77 genome-wide significant AM signals, 48 of which were not detected by GWAS, emphasizing the complementary nature of these two approaches. AM-tagged variants show significantly higher minor allele frequency and population differentiation (Fst) while GWAS demonstrated higher odds ratios, underscoring the distinct genetic architecture identified by each method. This study offers a comprehensive phenome-wide AM resource, demonstrating its utility in uncovering novel genetic associations in underrepresented populations, particularly for variants missed by traditional GWAS approaches.

16

Differential performance of polygenic prediction across traits and populations depending on genotype discovery approach

Lin, Y.-S.; Tan, T.; Wang, Y.; Pasaniuc, B.; Martin, A.; Atkinson, E. G.

2025-03-18 genetics 10.1101/2025.03.18.644029 medRxiv

Top 0.1%

9.8%

Show abstract

Polygenic scores (PGS) are widely used to estimate genetic predisposition to complex traits by aggregating the effects of common variants into a single measure. They hold promise in identifying individuals at increased risk for diseases, allowing earlier screening and interventions. Genotyping arrays, commonly used for PGS computation, are affordable and computationally efficient, while whole-genome sequencing (WGS) offers a more comprehensive view of genetic variation. In this study, we compared PGS derived from arrays and WGS across multiple traits to evaluate differences in predictive performance, portability across populations, and computational efficiency. We computed PGS for 10 traits, representing a range of heritability and polygenicity, in the three largest genetic ancestry groups in All of Us (European, African American, Admixed American), trained on multi-ancestry meta-analyses from the Pan-UK Biobank. Using the clumping and thresholding (C+T) method, we found that WGS-based PGS outperformed array-based PRS for highly polygenic traits but showed differentially reduced accuracy for sparse traits in certain populations. With the LD-informed PRS-CS method, we observed overall improved prediction performance compared to C+T, with WGS outperforming arrays across most non-cancer traits. The results obtained using PRS-CS closely align with those derived from pre-trained models in the PGS Catalog, with prediction achieving better performance using WGS than array genotypes for non-sparse traits. To further investigate factors influencing differential prediction performance between array and WGS, we ran simulations varying the proportions of causal SNPs directly captured by the technologies. These demonstrated that the proportion of causal variants genotyped dramatically affects prediction accuracy. Fine-mapping of empirical data supported this concept but also highlighted the importance of reducing non-informative variants for optimal prediction accuracy. In conclusion, while WGS-based PGS generally offer superior predictive power with PRS-CS, the advantage over arrays is context-dependent, varying by trait, population, and the PGS method. The ability to capture causal variants through these technologies largely drives the prediction accuracy. This study provides insights into the complexities and potential advantages of using different genotype discovery approaches for polygenic predictions across populations and informs on strategies to enhance accuracy.

17

The NR5A1/SF-1 variant p.Gly146Ala cannot explain the phenotype of individuals with a difference of sex development.

Martinez de Lapiscina, I.; Kouri, C.; Aurrekoetxea, J.; Sanchez, M.; Naamneh Elzenaty, R.; Sauter, K. S.; Camats, N.; Grau, G.; Rica, I.; Rodriguez, A.; Vela, A.; Cortazar, A.; Alonso-Cerezo, M. C.; Bahillo, P.; Berthod, L.; Esteva, I.; Castano, L.; Flueck, C. E.

2023-02-17 pediatrics 10.1101/2023.02.13.23285760 medRxiv

Top 0.1%

9.3%

Show abstract

Steroidogenic factor 1 (SF-1, NR5A1) plays an important role in human sex development. Variants of NR5A1/SF-1 may cause mild to severe differences of sex development (DSD) or may be found in healthy carriers. So far, the broad DSD phenotypic variability associated NR5A1/SF-1 variants remains a conundrum. The NR5A1/SF-1 variant c.437G>C/p.Gly146Ala is common in individuals with a DSD and has been suggested to act as a susceptibility factor for adrenal disease or cryptorchidism. However, as the allele frequency in the general population is high, and as functional testing of the p.Gly146Ala variant in vitro revealed inconclusive results, the disease-causing effect of this variant has been questioned. However, a role as a disease modifier in concert with other gene variants is still possible given that oligogenic inheritance has been described in patients with NR5A1/SF-1 gene variants. Therefore, we performed next generation sequencing in DSD individuals harboring the NR5A1/SF-1 p.Gly146Ala variant to search for other DSD-causing variants. Aim was to clarify the function of this variant for the phenotype of the carriers. We studied 14 pediatric DSD individuals who carried the p.Gly146Ala variant. Panel and whole-exome sequencing was performed, and data were analyzed with a specific data filtering algorithm for detecting variants in NR5A1- and DSD-related genes. The phenotype of the studied individuals ranged from scrotal hypospadias and ambiguous genitalia in 46,XY DSD to typical male external genitalia and ovotestes in 46,XX DSD patients. Patients were of African, Spanish, and Asian origin. Of the 14 studied subjects, five were homozygous and nine heterozygous for the NR5A1/SF-1 p.Gly146Ala variant. In ten subjects we identified either a clearly pathogenic DSD gene variant (e.g. in AR, LHCGR) or one to four potentially deleterious variants that likely explain the observed phenotype alone (e.g. in FGFR3, CHD7, ADAMTS16). Our study shows that most individuals carrying the NR5A1/SF-1 p.Gly146Ala variant, harbor at least one other deleterious gene variant which can explain the DSD phenotype. This finding confirms that the p.Gly146Ala variant of NR5A1/SF-1 may not contribute to the pathogenesis of DSD and qualifies as a benign polymorphism. Thus, individuals, in whom the NR5A1/SF-1 p.Gly146Ala gene variant has been identified as the underlying genetic cause for their DSD in the past, should be re-evaluated with a next-generation sequencing method to reveal the real genetic diagnosis.

18

A simple approach for multiple observations improves power to detect genetic effects and genomic prediction accuracy.

Evans, L. M.; Arehart, C. H.; Gibson, R. A.; Bowman, G. I.; Gignoux, C.

2025-09-21 genetic and genomic medicine 10.1101/2025.09.19.25336197 medRxiv

Top 0.1%

8.6%

Show abstract

Many datasets, including widely used biobanks, have more than one observation of numerous phenotypes for at least a portion of their sample. The majority of GWAS utilize only a single observation per individual, even when more than one observation may be available, and apply a standard model in which the additive allelic effect being estimated is assumed to be constant across the age or time range in the sample. Here, we test a set of simple approaches to utilize multiple observations per individual, under this same assumption. We find that utilizing the mean or median of the available observations rather than a single observation improves power to detect associated loci and enriched gene sets and yields higher out-of-sample polygenic score prediction accuracy. Despite growing biobanks, many deeply phenotyped samples are relatively small but have multiple observations. While explicitly modeling age- or time-dependent genetic effects can estimate time- or age-specific genetic effects, most GWAS apply a standard, additive-only model; a simple approach of using the mean or median can improve power by reducing "noise" in the phenotype, utilize standard, optimized software, and be particularly impactful for smaller samples, including samples of diverse genetic ancestry currently existing in widely used biobanks.

19

New Genetic Insights in Rheumatoid Arthritis using Taxonomy3(R), a Novel method for Analysing Human Genetic Data

Kozlowska, J.; Humphryes-Kirilov, N.; Pavlovets, A.; Connolly, M.; Kuncheva, Z.; Horner, J.; Sousa Manso, A.; Murray, C.; Fox, J. C.; McCarthy, A.

2023-02-24 rheumatology 10.1101/2023.02.21.23286176 medRxiv

Top 0.1%

8.5%

Show abstract

Genetic support for a drug target has been shown to increase the probability of success in drug development, with the potential to reduce attrition in the pharmaceutical industry alongside discovering novel therapeutic targets. It is therefore important to maximise the detection of genetic associations that affect disease susceptibility. Conventional statistical methods used to analyse genome-wide association studies (GWAS) only identify some of the genetic contribution to disease, so novel analytical approaches are required to extract additional insights. C4X Discovery has developed a new method Taxonomy3(R) for analysing genetic datasets based on novel mathematics. When applied to a previously published rheumatoid arthritis GWAS dataset, Taxonomy3(R) identified many additional novel genetic signals associated with this autoimmune disease. Follow-up studies using tool compounds support the utility of the method in identifying novel biology and tractable drug targets with genetic support for further investigation.

20

Functional and Computational Interrogation of the Juvenile Idiopathic Arthritis Risk Loci Identifies Candidate Causal SNPs and Target Genes in CD4+ T cells

Jiang, K.; Haley, E. K.; Barshad, G.; He, A.; Rogic, A.; Rice, E. J.; Sudman, M.; Thompson, S. D.; Danko, C. G.; Jarvis, J. N.

2025-12-16 genetic and genomic medicine 10.64898/2025.12.15.25342296 medRxiv

Top 0.1%

8.5%

Show abstract

GWAS have identified multiple genetic regions that confer risk for juvenile idiopathic arthritis (JIA). However, identifying the single nucleotide polymorphisms (SNPs) that drive disease risk has been impeded by the fact that the SNPs used to identify risk loci are in linkage disequilibrium (LD) with hundreds of other SNPs. Since the causal SNPs remain unknown, it is difficult to identify target genes and thus use genetic information to elucidate disease biology and inform patient care. We next used existing genotyping data from 3,939 children with JIA and 14,412 healthy controls to identify SNPs on JIA risk haplotypes that: present within open chromatin in multiple immune cell types and more common in children with JIA than the controls (p<0.05) in the genotyping data sets. We identified SNPs within cis-regulatory regions (CREs) using precision run-on sequencing data, and identified likely target genes using MicroC in both resting and activated CD4+ T cells. We identified 138 SNPs within the PROseq-identified CREs, and n=41 genes with which these CREs physically interacted. Data from GTEx corroborated these analyses by showing allelic effects for SNPs within the CREs in the ERAP2 and IRF1 risk loci. We further corroborated IRF1 allelic effects using a luciferase reporter assay. Our findings significantly reduce the genomic search space for risk-driving variants and target genes and support the roles of IRF1, ERAP2 and LNPEP in driving risk for JIA.